Data exploration in SQL and R on the Babynames data set.
(Also, an exploration in ioslides_presentation in RMarkdown) This is still a work in progress, and due to frame limitations, I can't show everything.
10/9/2018
Data exploration in SQL and R on the Babynames data set.
(Also, an exploration in ioslides_presentation in RMarkdown) This is still a work in progress, and due to frame limitations, I can't show everything.
Before loading all of the data into MariaDB, I had to merge it all together and make a column for year. I accomplished this in the following cell. Source: Stack Overflow
path <-
'/Users/brennandonnell/grad_school/data900/sql/names'
file_names <-
dir(path,
pattern = "yob" )
df <- do.call(
rbind,
lapply(file_names, function(x) cbind(fread(
x, sep=',',
col.names = c('babyname','m/f','frequency'),
colClasses = list(character=1:2, numeric=3)),
year=strsplit(strsplit(x,'yob')[[1]][[2]], '.txt')[[1]])))
df$length <- str_length(unlist(df[[1]]))
write.csv(df, paste0(path,'/fullnames.csv'), sep=',')
Loading Mariadb into mySQL and R. This uses several libraries:
con <- dbConnect( drv = RMariaDB::MariaDB(), dbname = 'Babynames', username = uname, password = pword )
this is just the framework for how to create a table in SQL. How I actually did it was by importing the full CSV into Sequel Pro through file -> import
CREATE TABLE demonoo ( baby_name CHAR(15), gender CHAR(1), frequency INT(11), year INT )
For Query 1, we searched for the most popular male name and female name for that person’s year of birth. The following cell is a small sample of it.
rs <- dbSendQuery(con, "SELECT babyname,gender, MAX(frequency), year FROM newnames GROUP BY gender, year ORDER BY year;") head(dbFetch(rs))
## babyname gender MAX(frequency) year ## 1 Mary F 7065 1880 ## 2 John M 9655 1880 ## 3 Mary F 6919 1881 ## 4 John M 8769 1881 ## 5 Mary F 8148 1882 ## 6 John M 9557 1882
dbClearResult(rs) #dbDisconnect(con)
It is possible in R Markdown to create chunks of SQL code. This is the same code as the previous slide, just done directly in SQL.
SELECT babyname,gender, MAX(frequency), year FROM newnames GROUP BY gender, year ORDER BY year;
| babyname | gender | MAX(frequency) | year |
|---|---|---|---|
| Mary | F | 7065 | 1880 |
| John | M | 9655 | 1880 |
| Mary | F | 6919 | 1881 |
For Query 2, we searched for the popularity of a given person’s name for each year from 1879 to 2017.
Naturally, I picked my own perfect name, to see how many people have the same name as mine.
rs <- dbSendQuery(con, "SELECT babyname, gender, frequency, year FROM newnames WHERE babyname = 'brennan' AND gender = 'm' ORDER BY year;") tail(dbFetch(rs), n=3)
## babyname gender frequency year ## 67 Brennan M 687 2015 ## 68 Brennan M 548 2016 ## 69 Brennan M 458 2017
dbClearResult(rs)
Again, R Notebooks and R Markdown have the ability to run SQL code. This is the same code as the cell above.
SELECT babyname, gender, frequency, year FROM newnames WHERE babyname = 'brennan' AND gender = 'm' ORDER BY year;
| babyname | gender | frequency | year |
|---|---|---|---|
| Brennan | M | 5 | 1916 |
| Brennan | M | 6 | 1949 |
| Brennan | M | 6 | 1950 |
For the following trick, we shall examine some visualizations. But first, we have to group the data.
by_brennan <- df %>% group_by(year) %>% filter(babyname == "Brennan")
look at name lengths
by_length <- df %>% filter(length >= 10) %>% group_by(year) %>% summarise(n = n())